Skip to content

Conversation

tianyu-l
Copy link
Contributor

Creating a new field in JobConfig, with the default being

[compile]
enable=false
components = ["model", "loss"]

This way we get to compile loss separately to get memory reduction, even when the model is not ready to be compiled.

This PR also applies loss compilation to DeepSeek 16B and 671B.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025
if parallel_dims.tp_enabled:
if (
job_config.parallelism.enable_async_tensor_parallel
and not job_config.training.compile
and not model_compile_enabled
):
raise RuntimeError("Async TP requires --training.compile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the error message

if parallel_dims.tp_enabled:
if (
job_config.parallelism.enable_async_tensor_parallel
and not job_config.training.compile
and not model_compile_enabled
):
raise RuntimeError("Async TP requires --training.compile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the error message as well

Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tianyu-l tianyu-l merged commit 08b8b24 into main Aug 21, 2025
10 checks passed
@tianyu-l tianyu-l deleted the compile branch August 21, 2025 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants